10-mlr

Professor Shannon Ellis

2/14/23

Multiple Linear Regression

Q&A

Coming Soon

Course Announcements

Coming soon

Suggested Reading

Introduction to Modern Statistics Chapter 8: Linear Regression with Multiple Predictors

Agenda

  • Multiple Linear Regression
    • Multiple predictors
    • Main vs interaction effects
    • Model comparison
    • Backward selection

Packages & Data

library(tidyverse)
library(tidymodels)

Data: Paris Paintings

pp <- read_csv("https://raw.githubusercontent.com/COGS137/datasets/main/paris_paintings.csv", 
               na = c("n/a", "", "NA")) |> 
  mutate(log_price = log(price))
  • Number of observations: 3393
  • Number of variables: 62

Two numerical predictors

Multiple predictors

  • Response variable: log_price
  • Explanatory variables: Width and height
lin_mod <- linear_reg() |>
  set_engine("lm")

pp_fit <- lin_mod |>
  fit(log_price ~ Width_in + Height_in, data = pp)
tidy(pp_fit)
# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   4.77     0.0579      82.4  0       
2 Width_in      0.0269   0.00373      7.22 6.58e-13
3 Height_in    -0.0133   0.00395     -3.36 7.93e- 4

Linear model with multiple predictors

# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   4.77     0.0579      82.4  0       
2 Width_in      0.0269   0.00373      7.22 6.58e-13
3 Height_in    -0.0133   0.00395     -3.36 7.93e- 4


\[\widehat{log\_price} = 4.77 + 0.0269 \times width - 0.0133 \times height\]

❓ How do we interpret this model?

Numerical and categorical predictors

Price, surface area, and living artist

  • Explore the relationship between price of paintings and surface area, conditioned on whether or not the artist is still living
  • First visualize and explore, then model
  • But first, prep the data:
pp <- pp |>
  mutate(artistliving = case_when(artistliving == 0 ~ "Deceased", 
                                  TRUE ~ "Living"))

pp |>
  count(artistliving)
# A tibble: 2 × 2
  artistliving     n
  <chr>        <int>
1 Deceased      2937
2 Living         456

Typical surface area

Typical surface area appears to be less than 1000 square inches (~ 80cm x 80cm). There are very few paintings that have surface area above 5000 square inches.

ggplot(data = pp, aes(x = Surface, fill = artistliving)) +
  geom_histogram(binwidth = 500) + 
  facet_grid(artistliving ~ .) +
  scale_fill_manual(values = c("#E48957", "#071381")) +
  guides(fill = "none") +
  labs(x = "Surface area", y = NULL) +
  geom_vline(xintercept = 1000) +
  geom_vline(xintercept = 5000, linetype = "dashed", color = "gray")

Narrowing the scope

For simplicity let’s focus on the paintings with Surface < 5000:

pp_Surf_lt_5000 <- pp |>
  filter(Surface < 5000)

ggplot(data = pp_Surf_lt_5000, 
       aes(y = log_price, x = Surface, color = artistliving, shape = artistliving)) +
  geom_point(alpha = 0.5) +
  labs(color = "Artist", shape = "Artist") +
  scale_color_manual(values = c("#E48957", "#071381"))

Facet to get a better look

ggplot(data = pp_Surf_lt_5000, 
       aes(y = log_price, x = Surface, color = artistliving, shape = artistliving)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~artistliving) +
  scale_color_manual(values = c("#E48957", "#071381")) +
  labs(color = "Artist", shape = "Artist")

Two ways to model

  • Main effects: Assuming relationship between surface and logged price does not vary by whether or not the artist is living.
  • Interaction effects: Assuming relationship between surface and logged price varies by whether or not the artist is living.

Interacting explanatory variables

  • Including an interaction effect in the model allows for different slopes, i.e. nonparallel lines.
  • This implies that the regression coefficient for an explanatory variable would change as another explanatory variable changes.
  • This can be accomplished by adding an interaction variable: the product of two explanatory variables.

Two ways to model

  • Main effects: Assuming relationship between surface and logged price does not vary by whether or not the artist is living
  • Interaction effects: Assuming relationship between surface and logged price varies by whether or not the artist is living

❓ Which does your intuition/knowledge of the data suggest is more appropriate?

Put a green sticky if you think main; pink if you think interaction.

Fit model with main effects

  • Response variable: log_price
  • Explanatory variables: Surface area and artistliving
pp_main_fit <- lin_mod |>
  fit(log_price ~ Surface + artistliving, data = pp_Surf_lt_5000)
tidy(pp_main_fit)
# A tibble: 3 × 5
  term               estimate std.error statistic  p.value
  <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        4.88     0.0424       115.   0       
2 Surface            0.000265 0.0000415      6.39 1.85e-10
3 artistlivingLiving 0.137    0.0970         1.41 1.57e- 1

\[\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times artistliving\]

Solving the model

  • Non-living artist: Plug in 0 for artistliving

\(\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times 0\)
\(= 4.88 + 0.000265 \times surface\)

  • Living artist: Plug in 1 for artistliving

\(\widehat{log\_price} = 4.88 + 0.000265 \times surface + 0.137 \times 1\)
\(= 5.017 + 0.000265 \times surface\)

Visualizing main effects

  • Same slope: Rate of change in price as the surface area increases does not vary between paintings by living and non-living artists.
  • Different intercept: Paintings by living artists are consistently more expensive than paintings by non-living artists.

Interpreting main effects

tidy(pp_main_fit) |> 
  mutate(exp_estimate = exp(estimate)) |>
  select(term, estimate, exp_estimate)
# A tibble: 3 × 3
  term               estimate exp_estimate
  <chr>                 <dbl>        <dbl>
1 (Intercept)        4.88           132.  
2 Surface            0.000265         1.00
3 artistlivingLiving 0.137            1.15
  • All else held constant, for each additional square inch in painting’s surface area, the price of the painting is predicted, on average, to be higher by a factor of 1.
  • All else held constant, paintings by a living artist are predicted, on average, to be higher by a factor of 1.15 compared to paintings by an artist who is no longer alive.
  • Paintings that are by an artist who is not alive and that have a surface area of 0 square inches are predicted, on average, to be 132 livres.

Main vs. interaction effects

  • The way we specified our main effects model only lets artistliving affect the intercept.
  • Model implicitly assumes that paintings with living and deceased artists have the same slope and only allows for different intercepts.

❓ What seems more appropriate in this case?

  • Same slope and same intercept for both colors
  • Same slope and different intercept for both colors
  • Different slope and different intercept for both colors

Interaction: Surface * artistliving

Fit model with interaction effects

  • Response variable: log_price
  • Explanatory variables: Surface area, artistliving, and their interaction
pp_int_fit <- lin_mod |>
  fit(log_price ~ Surface * artistliving, data = pp_Surf_lt_5000)
tidy(pp_int_fit)
# A tibble: 4 × 5
  term                        estimate std.error statistic    p.value
  <chr>                          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)                 4.91     0.0432       114.   0         
2 Surface                     0.000206 0.0000442      4.65 0.00000337
3 artistlivingLiving         -0.126    0.119         -1.06 0.289     
4 Surface:artistlivingLiving  0.000479 0.000126       3.81 0.000139  

Linear model with interaction effects

# A tibble: 4 × 5
  term                        estimate std.error statistic    p.value
  <chr>                          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)                 4.91     0.0432       114.   0         
2 Surface                     0.000206 0.0000442      4.65 0.00000337
3 artistlivingLiving         -0.126    0.119         -1.06 0.289     
4 Surface:artistlivingLiving  0.000479 0.000126       3.81 0.000139  

\[\widehat{log\_price} = 4.91 + 0.00021 \times surface - 0.126 \times artistliving\] \[+ ~ 0.00048 \times surface * artistliving\]

Interpretation of interaction effects

  • Rate of change in price as the surface area of the painting increases does vary between paintings by living and non-living artists (different slopes)
  • Some paintings by living artists are more expensive than paintings by non-living artists, and some are not (different intercept).
  • Non-living artist: \(\widehat{log\_price} = 4.91 + 0.00021 \times surface\) \(- 0.126 \times 0 + 0.00048 \times surface \times 0\) \(= 4.91 + 0.00021 \times surface\)
  • Living artist: \(\widehat{log\_price} = 4.91 + 0.00021 \times surface\) \(- 0.126 \times 1 + 0.00048 \times surface \times 1\) \(= 4.91 + 0.00021 \times surface\) \(- 0.126 + 0.00048 \times surface\) \(= 4.784 + 0.00069 \times surface\)

Comparing models

R-squared

  • \(R^2\) is the percentage of variability in the response variable explained by the regression model.
glance(pp_main_fit)$r.squared
[1] 0.01320884
glance(pp_int_fit)$r.squared
[1] 0.0176922
  • Clearly the model with interactions has a higher \(R^2\).
  • However using \(R^2\) for model selection in models with multiple explanatory variables is not a good idea as \(R^2\) increases when any variable is added to the model.

Adjusted R-squared

It appears that adding the interaction actually increased adjusted \(R^2\), so we should indeed use the model with the interactions.

glance(pp_main_fit)$adj.r.squared
[1] 0.01258977
glance(pp_int_fit)$adj.r.squared
[1] 0.01676753

Third order interactions

  • Can you? Yes
  • Should you? Probably not if you want to interpret these interactions in context of the data.

In pursuit of Occam’s razor

  • Occam’s Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected.

  • Model selection follows this principle.

  • We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model.

  • In other words, we prefer the simplest best model, i.e. parsimonious model.

Backward selection

For this demo, we’ll ignore interaction effects…and just model main effects to start:

pp_full <-  lin_mod |>
  fit(log_price ~ Width_in + Height_in + Surface + artistliving, data=pp) 

glance(pp_full)$adj.r.squared
[1] 0.02570141
  • \(R^2\) (full): 0.0257014

Remove artistliving

pp_noartist <- lin_mod |>
  fit(log_price ~ Width_in + Height_in + Surface, data=pp) 

glance(pp_noartist)$adj.r.squared
[1] 0.02579859
  • \(R^2\) (full): 0.0257
  • \(R^2\) (no artistliving): 0.0258

…Improved variance explained

Remove Surface

pp_noartist_nosurface <- lin_mod |>
  fit(log_price ~ Width_in + Height_in, data=pp) 

glance(pp_noartist_nosurface)$adj.r.squared
[1] 0.02231559
  • \(R^2\) (full): 0.0257
  • \(R^2\) (no artistliving): 0.0258
  • \(R^2\) (no artistliving or Surface): 0.0223

…no longer gaining improvement, so we stick with: log_price ~ Width_in + Height_in + Surface

Other approach:

# requires package installation: 
# install.packages("olsrr")
library(olsrr)

Step 1: Fit model (w/o tidymodels)

# fit the model (not using tidymodels)
mod <- lm(log_price ~ Width_in + Height_in + Surface + artistliving, data=pp_Surf_lt_5000)

Step 2: Determine which variables to remove

ols_step_backward_p(mod)

                             Elimination Summary                               
------------------------------------------------------------------------------
        Variable                      Adj.                                        
Step      Removed       R-Square    R-Square     C(p)        AIC         RMSE     
------------------------------------------------------------------------------
   1    artistliving      0.0261      0.0251    3.8495    12603.7727    1.8315    
------------------------------------------------------------------------------

…specifies that artistliving should be removed

Step 2 (alternate): Compare all possible models…

ols_step_all_possible(mod) |>
  arrange(desc(adjr))
   Index N                              Predictors     R-Square Adj. R-Square
1     11 3              Width_in Height_in Surface 0.0260749939  0.0251349118
2     15 4 Width_in Height_in Surface artistliving 0.0263412027  0.0250876993
3      5 2                      Width_in Height_in 0.0256902566  0.0250634893
4     12 3         Width_in Height_in artistliving 0.0259732581  0.0250330779
5      6 2                        Width_in Surface 0.0249136264  0.0242863596
6     13 3           Width_in Surface artistliving 0.0251787948  0.0242378477
7      7 2                   Width_in artistliving 0.0212864021  0.0206568018
8      1 1                                Width_in 0.0209415833  0.0206267736
9      8 2                    Surface artistliving 0.0132088377  0.0125897717
10     2 1                                 Surface 0.0125899681  0.0122803381
11    14 3          Height_in Surface artistliving 0.0130836930  0.0121310711
12     9 2                       Height_in Surface 0.0126782901  0.0120431523
13    10 2                  Height_in artistliving 0.0062698155  0.0056305552
14     3 1                               Height_in 0.0058797727  0.0055601199
15     4 1                            artistliving 0.0005531617  0.0002397573
   Mallow's Cp
1     3.849487
2     5.000000
3     3.077206
4     4.174132
5     5.555476
6     6.709309
7    17.130153
8    16.230489
9    51.971190
10   52.001268
11   45.305459
12   44.599123
13   65.048926
14   64.293574
15   91.485605

Recap

  • Can you model and interpret linear models with multiple predictors?
  • Can you explain the difference in a model with main effects vs. interaction effects?
  • Can you compare different models and determine how to proceed?
  • Can you carry out and explain backward selection?